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Figure 1: Example rendered images by the proposed system. All the images are rendered within one minute on GeForce GTX 680. The 
rendering process runs entirely on a GPU using only GLSL shaders. The system runs equally fine on many dijferent platforms while being 
capable of handling complex light transport paths. 


Abstract 

Ray tracing on GPUs is becoming quite common these days. There are many publicly available documents on how to implement 
basic ray tracing on GPUs for spheres and implicit surfaces. We even have some general frameworks for ray tracing on GPUs. 
We however hardly find details on how to implement more complex ray tracing algorithms themselves that are commonly used 
for photorealistic rendering. This paper explains an implementation of a stand-alone rendering system on GPUs which supports 
the bounding volume hierarchy and stochastic progressive photon mapping. The key characteristic of the system is that it uses 
only GLSL shaders without relying on any platform dependent feature. The system can thus run on many platforms that support 
OpenGL, making photorealistic rendering on GPUs widely accessible. This paper also sketches practical ideas for stackless 
traversal and pseudorandom number generation which both fit well with the limited system configuration. 

Categories and Subject Descriptors (according to ACM CCS): L3.7 [Computer Graphics]: Three-Dimensional Graphics and 
Realism—Raytracing 


1. Introduction 

The use of GPUs for ray tracing is becoming increasingly com¬ 
mon. In particular, details on how to implement a simple ray trac¬ 
ing system for a small number of objects or implicit surfaces are 
well documented in many publicly available tutorials. On the other 
hand, while there are several publicly available rendering systems 
on GPUs, implementation of such a more practical rendering sys¬ 
tem has been rarely documented. Nvidia’s OptiX [PBD*10] is one 
exception, yet OptiX itself is merely a general framework where 
we can implement various ray tracing algorithms on top of it. Even 
with the availability of such a general ray tracing framework, de¬ 
tails on implementations of specific ray tracing algorithms have to 
be sorted out. 

This paper explains an implementation of a rendering sys¬ 
tem which supports the bounding volume hierarchy [Wal07] 
and stochastic progressive photon mapping [HJ09] using only 
OpenGL 3.0 and GLSL 1.20. This limited system configuration 
was chosen for multiple practical reasons. Firstly, a program can 


reliably run on various operating systems and GPUs. While this 
platform independence of OpenGL is not perfect, it is mostly 
true for battle-tested versions such as the version 3.0. This is in 
contrast to vendor-specific implementations [PBD* 10, DKHS14] 
which are bound to be incompatible with GPUs of other vendors. 
Secondly, parallelization over multiple GPUs is automatically sup¬ 
ported without additional code. Unlike OpenGL, a graphics driver 
manages multiple GPUs with a general parallelization strategy. 
While this strategy can be suboptimal for each application, this sep¬ 
aration of management simplifies the implementation. Thirdly, a 
program can potentially run on web browsers via WebGL, since 
WebGL is essentially a limited version of OpenGL and becom¬ 
ing rapidly common. While WebGL [Khra] proposes support for 
more general GPU computation on any compatible web browser, 
no browsers currently support WebGL. There is no one to one cor¬ 
respondence between OpenGL and WebGL, but the proposed sys¬ 
tem uses mainly the features that are also available on WebGL. It 
is thus still making sense and practical to consider developing a 
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Figure 2: End-to-end performance on various GPUs on the metal 
Cornell box scene. The scene configuration is used by default in the 
released code. 

rendering system using a rather old version of OpenGL for general 
computation. 

This paper specifically sketches two practical ideas which are 
suitable for this limited system configuration. The first idea is a 
modification of the threaded bounding volume hierarchy [ST0O5] 
which improves the traversal performance by two to three times. 
The modification is to pre-sort all the nodes in a given threaded 
BVH along principal directions of rays and to store multiple in¬ 
stances of threaded BVHs. The modified traversal algorithm re¬ 
mains simple and its performance is on a practical level. The sec¬ 
ond idea is a pseudorandom number generator which uses only 
fioating-point numbers. This generator is computationally inexpen¬ 
sive while the quality of random numbers is sufficient. To summa¬ 
rize, the contributions are: 

• Open source GPU rendering system which uses limited features 
of OpenGL and thus is likely to run on WebGL. 

• Modification of the threaded BVH which allows an efficient and 
simple stackless traversal of a given BVH. 

• Introduction of a pseudorandom number generator which uses 
only fioating-point number operations. 

While similar work has been recently published by Davidovic et 
al. [DKHS14], their work focuses on a highly optimized implemen¬ 
tation on Nvidia’s GPUs using CUBA [NVI07]. The proposed sys¬ 
tem, on the other hand, is designed to be vendor independent. Since 
explaining all the details is not very informative, the following sec¬ 
tions outline only some high-level ideas. For more details, please 
refer to the released code. Figure 2 shows end-to-end performance 
on various GPUs. The code is available at http: / /www. ci . i . 
u-tokyo.ac.jp/~hachisuka/glslppm.zip as of May 
2015. 

2. Overview 

The proposed system supports stochastic progressive photon map¬ 
ping [HJ09] as the main rendering engine. Each iteration of 
stochastic progressive photon mapping consists of three stages. The 
first stage is photon tracing which samples light paths starting from 
light sources. The second stage traces eye paths from the camera 
until they hit non-specular surfaces. The last stage performs range 
queries at the intersection points of eye paths and executes stochas¬ 
tic progressive density estimation. The overall implementation is 
not done by merely porting existing algorithms to GLSL, but comes 
with several algorithmic modifications as noted later. 

Both the photon tracing stage and eye ray tracing stage need 
an efficient ray casting algorithm. While there exist many efficient 
ray casting algorithms [ALK12], many of them are not compatible 
with our limited system configuration. The proposed system thus 


node = cubemap(root_tex, ray.direction) ; 
while (node != null) { 

if (intersect(node.aabb, ray)) { 

if (node.leaf) result = inter sec t (node . triangles , ray); 

node = node.hit; 

} else { 

node = node . miss ; 

} 

} 

Figure 3: Traversal algorithm ofMTBVH. The only difference from 
the traversal algorithm of the original threaded BVH [ST0O5] is 
that it chooses an pre-ordered set of hit/miss links according to the 
ray direction (first line). 

employs the threaded BVH [ST0O5] (TBVH) to achieve a simple 
stackless traversal algorithm. The original algorithm unfortunately 
has major performance degradation due to the fixed traversal order. 
The proposed modification, the multiple-threaded BVH (MTBVH), 
alleviates this issue and improves the traversal performance by two 
to three times. 

Monte Carlo sampling of light paths and eye paths needs an ef¬ 
ficient and high quality random number generator. The main dif¬ 
ficulty is that our limited system configuration does not allow any 
native bitwise operations. Existing random number generators on 
GPUs rely on the availability of bitwise operations [TWOS] and 
thus are incompatible with our configuration. We introduce a ran¬ 
dom number generator that uses only floating-point number opera¬ 
tions. 

3. System components 
3.1. Multiple-Threaded BVH 

Since GLSL 1.20 (and GLSL in WebGL) does not support stack, 
we need a stackless traversal algorithm that runs efficiently on 
GPUs. On top of this practical reason, even if we could use 
stack, stackless traversal is known to have multiple advantages 
for massively parallel platforms [ASK 14] such as GPUs. Thread¬ 
ing [ST0O5] is one such approach to achieve stackless traversal 
of a given BVH. The threaded BVH stores hit/miss links instead 
of the tree structure to represent a given BVH. The traversal algo¬ 
rithm simply follows a hit or miss link depending on the result of 
the ray-AABB intersection test at each node. 

The proposed modification is to store six instances of hit/miss 
links by pre-sorting nodes along positive x axis, negative x axis, 
and so on for y and z as well. The traversal algorithm remains al¬ 
most the same, but it now selects one out of the six sets of links at 
the beginning, depending on the direction of a given ray. Eigure 3 
is the pseudocode of the traversal algorithm with the highlighted 
modification. This simple modification enables approximate opti¬ 
mization of the traversal order according to a given ray direction. 
Eigure 4 shows comparisons of ray traversal performance. While 
the multiple-threaded BVH is still not as fast as the vendor-specific 
optimization [ALK12], it is two to three times faster than original 
threaded BVH and remains vendor-independent. It should be em¬ 
phasized that the proposed algorithm is not fundamentally designed 
to achieve the best performance, but to achieve reasonable perfor¬ 
mance with only vendor-independent features. Please be aware that 
the provided code shows the number of complete paths per sec¬ 
ond including end-to-end rendering computation, not raw rays per 
second, thus the numbers will be different from those in Eigure 4. 
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Figure 4: Performance comparisons of ray casting using the 
vendor-specific optimized traversal [ALK12], the original threaded 
BVH [ST0O5] (TBVH), and the multiple-threaded BVH (MTBVH). 
The experiments used the SAH-BVH [Wal07] for all the algorithms 
and the computation times include diffuse shading. The testing en¬ 
vironment is GeForce GT 630. 

The implementation uses a standard top-down sweeping algo¬ 
rithm for constructing a BVH based on SAH [Wal07]. The con¬ 
struction and threading are both currently done on CPUs and the 
resulting data is transferred to GPUs afterward. Threading is usu¬ 
ally done less than 100 ms even for a typical scene and hardly be¬ 
comes the bottleneck. 

The storage overhead of MTBVH is not as significant as it ap¬ 
pears to be. For instance, if we count the number of VEC4s used 
in original threaded BVH, a triangle is stored as six VEC4s (two 
VEC4s for the packed position, normal, and texture coordinates for 
each vertex), an AABB is stored as two VEC4s (min and max), and 
hit/miss links can be packed into one VEC4. MTBVH adds only five 
more hit/miss links. Since the number of triangles and the number 
of nodes are typically not very different, MTBVH does not increase 
the total storage cost by six times, but by approximately 1.56 times 
(9 VEC4s of the original vs 14 VEC4s of ours). Since image textures 
usually add to the storage cost significantly more, the overhead of 
MTBVH is not significant. 

3.2. PRNG using only floating point numbers 

Since GLSL 1.20 does not support bitwise arithmetic opera¬ 
tions, we cannot port existing pseudorandom number generators 
(PRNGs) such as the one based on cryptographic hashing [TW08]. 
The system thus uses a weighted sum of multiplicative lin¬ 
ear congruential generators [L’e88] only with fioating-point 
number operations. Figure 5 shows the GLSL code of this 
PRNG. The original version of the algorithm was introduced 
as an anonymous post at the GPGPU web forum (http: 
//web.archive.org/web/20101217080108/http: 
//gpgpu.org/forums/viewtopic.php?t=25 91). The 
algorithm in Figure 5 has been modified to run well on GLSL. To 
generate many random numbers in parallel, one can initialize each 
PRNG state on a CPU via xorshift [M*03]. 

Note that the original algorithm [L’e88] uses integer arithmetic. 
While the authors have not observed any major artifacts due this 
change, changing the algorithm to use only fioating point numbers 


float GPURnd( inout vec4 state) { 

const vec4 q = vec4(1225, 1585, 2457, 2098); 

const vec4 r = vec4(1112, 367, 92, 265); 

const vec4 a = vec4(3423, 2646, 1707, 1999); 

const vec4 m = vec4 (4194287, 4194277, 4194191, 4194167); 

vec4 beta = floor(state / q); 

vec4 p = a * mod( state , q) — beta * r; 

beta = (sign(— p) + vec4(l)) * vec4(0.5) * m; 

state = (p + beta); 

return fract ( dot ( state / m, vec4(l, —1, 1, —1))); 

} 

Figure 5: Pseudorandom number generator using only fioating 
point number operations. The algorithm is based on a weighted 
sum of four instances of the multiplicative linear congruential gen¬ 
erator [L'e88]. 

might have introduced some statistical deficiency. The author of the 
above anonymous posting seems to claim certain quality of ran¬ 
domness within the 23 bit mantissa. 

3.3. Photon path regeneration 

A standard implementation of photon tracing keeps a global list of 
photons and sequentially adds a photon to the list. This approach 
however needs inter-threads commutation on GPUs to compact lists 
in order to obtain a complete global list of photons. The alternative 
approach used in the implementation is to trace a single bounce per 
pass and stores only one photon at most. For example, the very first 
pass traces photons from light sources toward the first intersections 
and stores the photons. Further bounces are traced only in succeed¬ 
ing passes. Each pass thus outputs at most one photon per thread, 
not a list of photons. 

A new photon path is generated at next pass if the current photon 
path is killed by Russian Roulette or misses a scene. This process is 
done independently between threads (in our case, threads are equal 
to pixels), thus each thread potentially traces a photon ray at a dif¬ 
ferent number of bounces. This approach keeps all threads busy all 
the time regardless of path length. A similar method is used for path 
tracing [NHDIO] and they reported performance improvement. We 
can observe similar improvement in the proposed system. 

3.4. Stochastic spatial hashing 

Hachisuka and Jensen [HJIO] proposed a stochastic hashing algo¬ 
rithm that utilizes the statistical nature of density estimation. They 
pointed two fundamental challenges in the use of regular spatial 
hashing on GPUs: the sequential nature of list constructions for 
hash table entries and uneven workload distribution at the data re¬ 
trieval phase. Their key idea is to let only one data survive for each 
hash table entry with concurrent writes, and to scale the contribu¬ 
tion of each photon by the number of hash collisions. 

The proposed modification is to assign a statistically independent 
random depth value (using the PRNG in Figure 5) for each photon 
path and to enable z-buffering for hashing photon data simultane¬ 
ously. Davidovic et al. [DKHS14] pointed that stochastic hashing 
can potentially produce wrong results if the timing that each photon 
data is hashed depends on some properties of the given photon path 
such as path length. This modification ensures that the probability 
that one photon survives over other photons is independent of any 
properties of the corresponding photon path. 
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4. Discussion 

All the shaders in the proposed system use features only of 
GLSL 1.20. They can thus run on WebGL which is based on 
OpenGL ES 2.0 and supports most features of GLSL 1.20. In prac¬ 
tice, however, they do not run on WebGL alone, since the offi¬ 
cial specification of WebGL does not guarantee a native support of 
some important features such as rendering to floating point num¬ 
ber textures. It should also be repeated that there is no one to one 
correspondence between a certain version of OpenGL and WebGL. 
Having said that, since WebGL can support floating point textures 
via the extension, it is most likely possible to port the code for We¬ 
bGL. WebGL 2 [Khrb] will natively support this extension. 

4.1. Limitations and future work 

While the code is intended to be platform-independent, it is pos¬ 
sible to fail due to the vendor-dependent JIT compilation model 
of GLSL. No fundamental modification, however, will be neces¬ 
sary to make the code successfully executable. The author would 
appreciate a bug report in case you found any. The current im¬ 
plementation does not build an acceleration data structure using 
GPUs. It is however trivial to implement linear BVH [LGS*09] 
using bitonic sorting [PDC*03] within our limited configuration, 
if one wants to build an acceleration data structure on GPUs. A 
more efficient construction for a high quality tree [GPMll] using 
only GLSL could however be challenging due the requirement of 
a more advanced thread management such as work queue. While 
stochastic progressive photon mapping covers many scene config¬ 
urations, one might want to implement more advanced rendering 
algorithms such as unified path sampling [HPJ12] / vertex merg¬ 
ing and connection [GKDS12]. Davidovic et al. [DKHS14] show 
how to implement some of such algorithms using CUDA, but port¬ 
ing their algorithms on GLSL may pose some challenges. The pro¬ 
posed system does not support out-of-core rendering. An efficient 
out-of-core rendering on GPUs is still an open problem even with 
more general GPU computation platforms. This feature, however, 
might not be necessary for some applications such as rendering for 
online shopping. 

5. Conclusion 

The paper outlines an implementation of a rendering system via 
OpenGL 3.0 and GLSL 1.20 without relying on any platform de¬ 
pendent feature. One of the proposed modifications, the multiple- 
threaded BVH, uses the principal direction of a given ray to select 
one of the threaded BVHs. This simple modification results in two 
to three times performance improvement compared to the original 
threaded BVH. Unlike common pseudorandom number generators, 
the proposed generator uses only fioating point number operations 
based on a combination of multiplicative linear congruential gener¬ 
ators. The resulting generator is computationally inexpensive and 
the quality of generated random numbers is enough for the pro¬ 
posed rendering system. The author believes that the code can serve 
as an example implementation of a rendering system using only 
platform independent features. 
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